AGGREGATION BY EXPONENTIAL WEIGHTING, SHARP 
ORACLE INEQUALITIES AND SPARSITY 



A. DALALYAN AND A.B. TSYBAKOV+ 

Abstract. We study the problem of aggregation under the squared loss in the 
model of regression with deterministic design. We obtain sharp PAC-Bayesian 
risk bounds for aggregates defined via exponential weights, under general as- 
sumptions on the distribution of errors and on the functions to aggregate. We 
then apply these results to derive sparsity oracle inequalities. 



1. Introduction 

Aggregation with exponential weights is an important tool in machine learning. 
It is used for estimation, prediction with expert advice, in PAC-Bayesian settings 
and other problems. In this paper we establish a link between aggregation with 
exponential weights and sparsity. More specifically, we obtain a new type of oracle 
inequalities and apply them to show that the exponential weighted aggregate with 
a suitably chosen prior has a sparsity property. 

We consider the regression model 

(I) Yi = f(xi)+€i, i = l,...,n, 

where X\,...,x n are given non-random elements of a set X, f : X — ► R is an 
unknown function, and £j are i.i.d. zero-mean random variables on a probability 
space (fi, T , P) where fi C R. The problem is to estimate the function / from the 
data D n = {{x\,Y\), . . . , (x n ,Y„)). 

Let (A, A) be a measurable space and denote by the set of all probability 
measures defined on (A, A). Assume that we are given a family {fx, A 6 A} of 
functions fx : X — > R such that the mapping A i— > f\(x) is measurable for all 
x E X, where R is equipped with the Borel er-field. Functions fx can be viewed 
either as weak learners or as some preliminary estimators of / based on a training 
sample independent of Y = (Yi, . . . ,Y n ) and considered as frozen. 

We study the problem of aggregation of functions in {fx, A G A} under the 
squared loss. The aim of aggregation is to construct an estimator /„ based on the 
data D n and called the aggregate such that the expected value of its squared error 

ll/n-/lln = ^E(/n(^)-/(^)) 2 
TL 

i—1 

is approximately as small as the oracle value inf^eA ||/ — /xlln- 
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In this paper we consider aggregates that are mixtures of functions fx with 
exponential weights. For a measure it from and for (3 > we set 

(2) fax) 4 [ e x (Y)f x (x)n(d\), xeX, 

J A 

with 

( 3 ) W) = «p{-»» y -MIS/"} 



J A exp{-n||Y-/„E//3}»-( ( i«.) 

where || Y — f\\\ 2 n = ^ J2i=i — f\{ x i)) 2 an d we assume that ir is such that the 
integral in ([2]) is finite. 

Note that 9\{Y) — 6\(f3, it, Y), so that /„ depends on two tuning parameters: 
the probability measure ir and the "temperature" parameter (3. They have to be 
selected in a suitable way. 

Using the Bayesian terminology, tt(-) is a prior distribution and /„ is the posterior 
mean of f\ in a "phantom" model 

(4) Y i = h(x i )+t! i 

where ^ are i.i.d. normally distributed random variables with mean and variance 
(3/2, 

The idea of mixing with exponential weights has been discussed by many au- 
thors apparently since 1970- ies (see [38] for an overview of the subject). Most of 
the work has been focused on the important particular case where the set of es- 
timators is finite, i.e., w.l.o.g. A = {1, . . . , M}, and the distribution 7r is uniform 
on A. Procedures of the type (JU)-© with general sets A and priors 7r came into 
consideration quite recently [9[ [10l [36l 0J |4TJ [42j [TJ [2] , partly in connection with 
the PAC-Bayesian approach. For finite A, procedures (J2j) — (J3j) were independently 
introduced for prediction of deterministic individual sequences with expert advice. 
Representative work and references can be found in [35l [26l [Til E21 H3]; in this 
framework the results are proved for cumulative loss and no assumption is made 
on the statistical nature of the data, whereas the observations Y{ are supposed to 
be uniformly bounded by a known constant. 

We mention also related work on cumulative exponential weighting methods: 
there the aggregate is defined as the average n" 1 Ylk=i fk- F° r regression models 
with random design, such procedures are introduced and analyzed in [9], pjj] and 
[57] . In particular, [S] and [TU] establish a sharp oracle inequality, i.e., an inequality 
with leading constant 1. This result is further refined in [4] and [20]. In addition, 
[20] derives sharp oracle inequalities not only for the squared loss but also for general 
loss functions. However, these techniques are not helpful in the framework that we 
consider here, because the averaging device is not meaningfully adapted to models 
with non- identically distributed observations. 

For finite A, the aggregate f n can be computed on-line. This, in particular, 
motivated its use for on-line prediction. Papers [50], [5JJ point out that /„ and 
its averaged version can be obtained as a special case of mirror descent algorithms 
that were considered earlier in deterministic minimization. Finally, 12, 20J establish 
some links between the results for cumulative risks proved in the theory of prediction 
of deterministic sequences and generalization error bounds for the aggregates in the 
stochastic i.i.d. case. 
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In this paper we obtain sharp oracle inequalities for the aggregate f n under the 
squared loss, i.e., oracle inequalities with leading constant 1 and optimal rate of 
the remainder term. Such an inequality has been pioneered in |25] in a somewhat 
different setting. Namely, it is assumed in [25] that A is a finite set, the errors & are 
Gaussian and fx are estimators constructed from the same sample D n and satisfying 
some strong restrictions (essentially, these should be the projection estimators). 
The result of [25] makes use of Stein's unbiased risk formula, and gives a very precise 
constant in the remainder term of the inequality. Inspection of the argument in 
[2"5] shows that it can be also applied in the following special case of our setting: 
fx are arbitrary fixed functions, A is a finite set and the errors £j are Gaussian. 

The general line of our argument is to establish some PAC-Bayesian risk bounds 
(cf. ||HJ), (I10|) ) and then to derive sharp oracle inequalities by making proper choices 
of the probability measure p involved in those bounds (cf. Sections [5j [7J) . 

The main technical effort is devoted to the proof of the PAC-Bayesian bounds 
(Sections [3l [4j [6j) . The results are valid for general A and arbitrary functions 
fx satisfying some mild conditions. Furthermore, we treat non-Gaussian errors 
£j. For this purpose, we suggest three different approaches to prove the PAC- 
Bayesian bounds. The first one is based on integration by parts techniques that 
generalizes Stein's unbiased risk formula (Section [3]). It is close in the spirit to 
|25j . This approach leads to most accurate results but it covers only a narrow 
class of distributions of the errors £j. In Section [4] we introduce another techniques 
based on dummy randomization which allows us to obtain sharp risk bounds when 
the distributions of errors £i are n-divisible. Finally, the third approach (Section 
|6]) invokes the Skorokhod embedding and covers the class of all symmetric error 
distributions with finite moments of order larger than or equal to 2. Here the price 
to pay for the generality of the distribution of errors is in the rate of convergence 
that becomes slower if only smaller moments are finite. 

In Section [7] we analyze our risk bounds in the important special case where fx 
is a linear combination of M known functions cf>i , . . . , 4>m with the vector of weights 
A = (Ai, . . . , Am): fx = Ylj=x ^j4>j- This setting is connected with the following 
three problems. 

1. High- dimensional linear regression. Assume that the regression function has 
the form / = fx* where A* € R M is an unknown vector, in other words we have 
a linear regression model. During the last years a great deal of attention has been 
focused on estimation in such a linear model where the number of variables M is 
much larger than the sample size n. The idea is that the effective dimension of the 
model is defined not by the number of potential parameters M but by the unknown 
number of non-zero components M(X*) of vector A* that can be much smaller 
than n. In this situation methods like Lasso, LARS or Dantzig selector are used 
[T71 [8]. It is proved that if M(A*) <C n and if the dictionary {<pi, . . . , 4>m} satisfies 
certain conditions, then the vector A* and the function / can be estimated with 
reasonable accuracy [HI El El HOI [3] . However, the conditions on the dictionary 
{4>i, ■ ■ ■ ,0m} required to get risk bounds for the Lasso and Dantzig selector are 
quite restrictive. One of the consequences of our results in Section [7] is that a 
suitably defined aggregate with exponential weights attains essentially the same 
and sometimes even better behavior than the Lasso or Dantzig selector with no 
assumption on the dictionary, except for the standard normalization. 
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2. Adaptive nonparametric regression. Assume that / is a smooth function, and 
{4>i, ■ ■ ■ , <?W} are the first M functions from a basis in L2(M. d ). If the basis is 
orthonormal, it is well-known that adaptive estimators of / can be constructed in 
the form Xi^li Xj'Aj wnere Xj are appropriately chosen data-driven coefficients and 
M is a suitably selected integer such that M < n (cf.,e.g., [27, 32J). Our aggregation 
procedure suggests a more general way to treat adaptation covering the problems 
where the system {4>j} is not necessarily orthonormal, even not necessarily a basis, 
and M is not necessarily smaller than n. In particular, the situation where M ^> n 
arises if we want to deal with sparse functions / that have very few non-zero scalar 
products with functions from the dictionary {4>j}, but these non-zero coefficients 
can correspond to very high "harmonics" . The results of Section 7 cover this case. 

3. Linear, convex or model selection type aggregation. Assume now that <j>\,. . . ,4>m 
are either some preliminary estimators of / constructed from a training sample in- 
dependent of (Yi, . . . , Y n ) or some weak learners, and our aim is to construct an 
aggregate which is approximately as good as the best among <f>x , . . . , 4>m or ap- 
proximately as good as the best linear or convex combination of (j>x , . . . , 4>m ■ In 
other words, we deal with the problems of model selection (MS) type aggregation 
or linear/convex aggregation respectively J37J 131] • It is shown in [5] that a BIC type 
aggregate achieves optimal rates simultaneously for MS, linear and convex aggrega- 
tion. This result is deduced in [6] from a sparsity oracle inequality (SOI), i.e., from 
an oracle inequality stated in terms of the number M(A) of non-zero components 
of A. For a discussion of the concept of SOI we refer to [33]. Examples of SOI are 
proved in [231 El [3H [3] for the Lasso, BIC and Dantzig selector aggregates. 
Note that the SOI for the Lasso and Dantzig selector are not as strong as those for 
the BIC: they fail to guarantee optimal rates for MS, linear and convex aggrega- 
tion unless tf>i, ... , 4>M satisfy some very restrictive conditions. On the other hand, 
the BIC aggregate is computationally feasible only for very small dimensions M . 
So, neither of these methods achieves both the computational efficiency and the 
optimal theoretical performance. 

In Section 7 we propose a new approach to sparse recovery that realizes a com- 
promise between the theoretical properties and the computational efficiency. We 
first suggest a general technique of deriving SOI from the PAC-Bayesian bounds, 
not necessarily for our particular aggregate /„. We then show that the exponen- 
tially weighted aggregate /„ with an appropriate prior measure 7r satisfies a sharp 
SOI, i.e., a SOI with leading constant 1. Its theoretical performance is comparable 
with that of the BIC in terms of sparsity oracle inequalities for the prediction risk. 
No assumption on the dictionary (pi,.. . , 4>m is required, except for the standard 
normalization. Even more, the result is sharper than the best available SOI for 
the BIC-type aggregate [6], since the leading constant in the oracle inequality of 
6J is strictly greater than 1. At the same time, similarly to the Lasso and Dantzig 
selector, our method is computationally feasible for moderately large dimensions M. 

2. Some notation 

In what follows we will often write for brevity 6\ instead of 9\ (Y). For any 
vector z = (zi, . . . , z n ) T S R™ set 

"*■=(«£-•) ■ 
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Denote by 3^'^ the set of all measures \i £ &a such that A i— > f\(x) is integrable 
w.r.t. /i for x € {x%, . . . , x n }. Clearly is a convex subset of £?a- For any 
measure fi £ we define 

U{x%)= \ fx(xi) fi(d\), i = l,...,n. 
J A 

We denote by 8 ■ n the probability measure Ah J a 6\ ir(dX) defined on A. With 
the above notation, we can write 

In = J8--K- 

3. A PAC-Bayesian bound based on unbiased risk estimation 

In this section we prove our first PAC-Bayesian bound. An important element 
of the proof is an extension of Stein's identity which uses integration by parts. For 
this purpose we introduce the function 

/X />oo 
zdF i {z)= / zdF £ (z), 
-oo J X 

where F^(z) = -P(£i < z) is the c.d.f. of £, l(-) denotes the indicator function and 
the last equality follows from the assumption = . Since E\^i\ < oo the 

function is well defined, non negative and satisfies m^(— oo) = m^(+oo) = 0. 
Moreover, is increasing on (— oo, 0], decreasing on [0, +00) and max l6 i m^(x) — 
m^(0) = We will need the following assumption. 

(A) E(t;f) = a 2 < 00 and the measure m^(z) dz is absolutely continuous with re- 
spect to dF^(z) with a bounded Radon- Nikodym derivative, i.e., there exists 
a function g^ : K — > M + such that = su Pa;eR 9i ( x ) < 00 an< ^ 

m^(z) dz — / g^(z)dF^(z), Vo, o' G R. 

J a 

Clearly, Assumption (A) is a restriction on the probability distribution of the errors 
£i. Some examples where Assumption (A) is fulfilled are: 

(i) If £1 - Af(0, cr 2 ), then g ( (x) = a 1 . 

(ii) If £1 is uniformly distributed in the interval [—6,6], then m^(x) = (6 2 — 
x 2 ) + /(46)and 55 (z) = (6 2 -x 2 ) t /2. 

(iii) If £1 has a density function /j with compact support [—6, 6] and such that 
fi( x ) ^ /min > for every x £ [—6, 6], then assumption (A) is satisfied with 
g s (x) = m 6 (x)/ft(x) < E\^\/(2f mia ). 

We now give some examples where (A) is not fulfilled: 

(iv) If £1 has a double exponential distribution with zero mean and variance cr 2 , 
then gz(x) = (cr 2 + V2o^\x\)/2. 

(v) If £1 is a Rademacher random variable, then m^(x) = t(\x\ < l)/2, and the 
measure m^(x)dx is not absolutely continuous with respect to the distribu- 
tion of £1. 

The following lemma can be viewed as an extension of Stein's identity (cf. [24 J. 

Lemma 1. Let T n (x, Y) be an estimator of f(x) such that the mapping Y 1— > 
T„(Y) = (T n (xi, Y), . . . , T n (x n , Y)) T is continuously differ entiable and let us de- 
note by djT n (xi,Y) the partial derivative of the function Y 1— > T n (Y) with respect 
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to the jth coordinate ofY. If Assumption (A) and the following condition 



(5) 



v 

\y\ I \diT n (xi,f + z)\dzidFz(y) < oo, i=l,...,n, 
o 

or 



diT n (xu Y) > 0, VY G K'\ i=l,...,n, 

are satisfied where z = (zi, . . . , z n ) T , / = (/(xi), . . . , /(x„)) T t/ien 

E[r n (Y)]=E(\\T n (Y)-f\\l), 



where 



f„(Y) = \\T n (Y)-Y\\l + -J2d t T n (x l ,Y)g^ l )-a 2 
Proof. We have 



i=l 



E 



(6) 



(||T„(Y)-/||; 



\T n {Y) - Yf n + -J2ti(T n ( Xi , Y) - /(*«))' 



|T„(Y)-Y||2 +- V^T^Y) 

rj ^- — ' 



For z = (zi, . . . , z n ) 1 £l" write i"£,i(z) = fT^, F d z j)- Since -^(6) = we have 



(7) 



E[i,Tj.r,.Y)} = e i, / ** ^TnCxi, Yx, . . . , + z, Y i+1 , ...,Y n )dz 

fV 



y diT n {xi,f + z)dz l dF i (y) dF Ll {z). 



o 



Condition (JSJ) allows us to apply the Fubini theorem to the expression in squared 
brackets on the right hand side of the last display. Thus, using the definition of 
and Assumption (A) we find 

/ VI d{T n (xj , / + z) dzi dF^{y)= I I y dF^ (y) d t T n (x t , / + z) dz t 

JR + JO JR+ J Zi 

m^(zi) d l T n (x l ,f + z) dzi 

g$ (z.i) d{T n (xj, / + z) dF 6 (zi). 

Similar equality holds for the integral over R_. Thus, in view of l[T|). we obtain 

E[(,iT n (xi, Y)] = ^[^(xi, Y) 
Combining the last display with ((51) we get the lemma. □ 

Based on Lemma[T]we obtain the following bound on the risk of the exponentially 
weighted aggregate /„. 
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Theorem 1. Let ir be an element of ZPa such that, for all Y' 6 R" and (3 > 0, the 
mappings A i— > 9\{Y')f 2 {xi), i — are ir-integrable. If Assumption (A) is 

fulfilled then the aggregate f n defined by {J|) im'i/i /3 > 4 1 1 1 1 oo satisfies the inequality 

(8) E(\\f n - /|| 2 ) < / HA - /||» p(dA) + ffi^l, V P 6^ Al 

where /C(p, tt) stands for the Kullback-Leibler divergence between p and tt . 

Proof. We will now use Lemma 1 with T„ = /„. Accordingly, we write here 
f n (xi,Y) instead of f n {xi)- Applying the dominated convergence theorem and 
taking into account the definition of 9\(Y) we easily find that the 7r-integrability 
of A h 6\(Y')f\{xi) for all i, Y' implies that the mapping Y i— > / n (Y) = 
(f n (xi, Y), . . . , f n (x n ,Y)) T is continuously differentiable. Simple algebra yields 

2 f /" r2/ 



fc/nfo.Y) = - j J fi(xi)0 x (Y)n{dX) - f n {xi,Y) 

= ~ y (/a(3*) ~ /„(xi, Y)) 2 A (Y) Tr(dA) > 0. 

Therefore, ([5]) is fulfilled for T„ = f n and we can apply Lemma [1] which yields 

i?[f„(Y)]=i?(||/„(Y)-/|| 2N 
with 

2 - 

f„(Y) = ||/„(Y) - Y|| 2 + - J^diUixi, Y)gtfa) - a 2 . 

i=l 

Since f n (Y) is the expectation of f\ w.r.t. the probability measure 9 ■ tt, 
||/„(Y)-Y|| 2 = / {||/a-Y|| 2 -||/ A -/„(Y)|| 2 }0 A (Y) 7 r( fl !A). 

J A 

Combining these results we get 
f„(Y) = jf {||/ A - Y|| 2 - Er=i U=<Ssm&Sptte!i^ } 9x (Y) n(dX) - a 2 

< f \\f x ~Y\\ 2 n 9 x (Y)n(d\)-o- 2 , 

J A 

where we used that (3 > 4 1 1 <7^ 1 1 oo - By definition of 9\, 

-n||/ A -Y|| 2 =/31og0A(Y)+/?log[ / e -"ll Y -/»ll»^7r(dw)]. 

J A 

Integrating this equation over 9 ■ tt, using the fact that J A 0\(Y) log 9\ (Y) 7r(dA) = 
JC(9 ■ tt, tt) > and convex duality argument (cf., e.g., [15], p. 264, or [10], p. 160) 
we get 



fn(Y) <~~\0g 

n 



A 



e-nVv-MVe^dw) 



-a 2 



< / || Y- f w \\ n p(dw) + a 



A 



n 



for all p € Taking expectations in the last inequality we obtain ([8]). □ 



s 
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4. Risk bounds for ^-divisible distributions of errors 



In this section we present a second approach to prove sharp risk bounds of the 
form ([5]). The main idea of the proof consists in an artificial introduction of a 
"dummy" random vector £ G M" independent of £ = (£1, . . . , £„) and having the 
same type of distribution as This approach will allow us to cover the class of 
distributions of £j satisfying the following assumption. 

(B) There exist i.i.d. random variables Cn defined on an enlargement of 
the probability space (Q, J- ', P) such that: 

(Bl) the random variable £1 + Q\ has the same distribution as (1 + l/n)£i, 
(B2) the vectors C, = (Ci> • • • > Cn) an d £ = (£i> • • • > £n) are independent. 
If £i satisfies (Bl), then we will say that its distribution is n-divisible. 

We will need one more assumption. Let : K — > R U {oo} be the moment 
generating function of the random variable £i, i.e., L^(t) — E(e tl ' 1 ), f 6 t. 

(C) There exist a functional : &' A x 2? A — > K and a real number {3q > 



/g\ ( fi *—>■ typ([j,,f/) is concave and continuous in the total 

variation norm for any /jf £ ^' A , 
Vpfan) = 1, 
/or any /3 > /3 . 

We now discuss some sufficient conditions for assumptions (B) and (C). Denote by 
T> n the set of all probability distributions of £i satisfying assumption (Bl). First, it 
is easy to see that all the zero-mean Gaussian or double exponential distributions 
belong to T> n . Furthermore, T> n contains all the stable distributions. However, since 
the non-Gaussian stable distributions do not have second order moments, they do 
not satisfy ©. O ne can also check that the convolution of two distributions from 
T> n belongs to T> n . Finally, note that the intersection T> = C\ n >\D n is included 
in the set of all infinitely divisible distributions and is called the L-class (see [29] , 
Theorem 3.6, p. 102). 

However, some basic distributions such as the uniform or the Bernoulli distri- 
bution do not belong to V n . To show this, let us recall that the characteristic 
function of the uniform on [—a, a] distribution is given by tp(t) = sm(at)/ (irat). For 
this function, ip((n + l)t)/tp(nt) is equal to infinity at the points where sin(nat) 
vanishes (unless n — 1). Therefore, it cannot be a characteristic function. Similar 
argument shows that the centered Bernoulli and centered binomial distributions do 
not belong to T> n . 

Assumption (C) can be readily checked when the moment generating function 
L^{t) is locally sub-Gaussian, i.e., there exists a constant c > such that the 
inequality L^(t) < e ct holds for sufficiently small values of t. Examples include 
all the zero-mean distributions with bounded support, the Gaussian and double- 
exponential distributions, etc. The validity of Assumption (C) for such distributions 
follows from Lemma 0] in the Appendix. 

Theorem 2. Let it be an element of such that, for all Y' € R" and (3 > 0, 
the mappings X h- > 8\(Y')f\(xi), i = 1, . . . ,n, are ir-integrable. If assumptions (B) 



such that 
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and (C) are fulfilled, then the aggregate /« defined by with f3 > f3o satisfies the 
inequality 

(10) E(\\f n -f\\l)<j\\f x -f\\lp(dX) + ^^, Vp e # A . 

Proof. Define the mapping H : £?" A — ► R" by 

H tl = (f^) - f( Xl ), . . . , f^x n ) - /(x„)) T , n G 

For brevity, we will write 

ft, a = -ff«5 A = (/a (ail) - f(xi),...,f\(x n ) - f{x n )) T , X G A, 

where <5a is the Dirac measure at A (that is <5a(^4) = 1(A G A) for any A E A). 

Since = 0, assumption (Bl) implies that E(Q) — for i — 1, . . . ,n. On 

the other hand, (B2) implies that £ is independent of Q\. Therefore, we have 

l/e-7r — /lln — %C T He.ir • 



(11) £(11/*.. - /|| 2 ) = ^logexp{ ll/fl - Os. ^ g^ } = S + Sl 
where 

\fx-f\\l-2C T h 



S = -f3E log J 9 X exp { - ll/A fL 2C ^ ^(dA), 
Sl = ^log/l exp { - ^ - II A - y ± ^ - j^). 



The definition of 9\ yields 

S = -^log f exp f - ^-MI^IIA-^-^l^ 
/a P J 

(12) + /3£log^exp{- n " Y ^ /A "" }7r(dA). 

Since ||Y - / x ||* - ||€ll?. - 2n- 1 £ T h A + ||/a - /Ifc, we get 

5 = -^log f exp f - (" + 1)11^-/^-^ + 0^1^ 
/a P J 

o_«FI /" J n ll/-/All»-^ Tfc A -[ 

+ P-B log exp | j7r(dA) 

(13) =/3£;iog / e- np{x) n(d\) - f3E log [ e-^ +1 ^^Tr(dX), 

J A J A 

where we used the notation p(X) = (\\f — /a||^ — 2n~ 1 £ T /iA)//3 and the fact that 
£ + C can be replaced by (1 + l/n)£ inside the expectation. The Holder inequality 
implies that J A e~ np( ^Tr(dX) < (f A e- < - n+1 ^ p( - x K(dX))^ . Therefore, 

(14) 5 < ^-£log / e - {n+1 '> p(x) ir(dX). 

Assume now that p G &a is absolutely continuous with respect to tt. Denote by 
4> the corresponding Radon-Nikodym derivative and by A+ the support of p. Using 
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the concavity of the logarithm and Jensen's inequality we get 
-Slog / e- (n+1)p(A) 7r(dA) < -Slog / e - ( " +1)p(A) 7r(dA) 

J A JA + 

= -E\ogj e- ( ^ n+1)pw <l)- 1 (X)p{dX) 
Ja + 

<(n + 1)E [ p(\)p(d\) + [ log (f>(X)p(dX). 

JA + JA + 



Noticing that the last integral here equals to JC(p,n) and combining the resulting 
inequality with (| 14[) we obtain 

S<0E [ P (X)p(dX) + ^l. 

J A n + 1 

Since E(£i) = for every i — 1, . . . ,n, we have (3E(p{X)) = \\f\ — /||^, and using 
the Fubini theorem we find 

(15) / \\h-f\\l P {dX) + ^ M 



A 



1 



Note that this inequality also holds in the case where p is not absolutely continuous 
with respect to 7r, since in this case JC(p,n) = oo. 

To complete the proof, it remains to show that Si < 0. Let E^(-) denote the 
conditional expectation E(-\£). By the concavity of the logarithm, 

5 X <(3Elog f 9 xEi exp { = = II/a = f f" + ^ {hx - UdX). 

J A P > 

Since f\ = fg x and ^ is independent of 9\, the last expectation on the right hand 
side of this inequality is bounded from above by ^ p(5\ : 6 ■ tt). Now, the fact that 
Si < follows from the concavity and continuity of the functional ^^(-,9 ■ tt), 
Jensen's inequality and the equality ^p(6 ■ w, 6 ■ w) = 1. □ 

Another way to read the results of Theorems Q] and [2] is that, if the "phantom" 
Gaussian error model (H]) with variance taken larger than a certain threshold value 
is used to construct the Bayesian posterior mean /„, then /„ is close on the average 
to the best prediction under the true model, even when the true data generating 
distribution is non-Gaussian. 

We now illustrate application of Theorem [2] by an example. Assume that the 
errors £j are double exponential, that is the distribution of ^ admits a density with 
respect to the Lebesgue measure given by 

Ml) = ~^L= e -V2|^|/ CT g R 

Aggregation under this assumption is discussed in [39| where it is recommended to 
modify the weights © matching them to the shape of fc. For such a procedure 
|39j proves an oracle inequality with leading constant which is greater than 1. The 
next proposition shows that sharp risk bounds (i.e., with leading constant 1) can 
be obtained without modifying the weights ([3]). 

Proposition 1. Assume that supxeA II/ - fx\\n < L < go ands\yp i x \ f\(xi)\ < L < 
oo. Let the random variables £j be i.i.d. double exponential with variance a 2 > 0. 
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Then for any f3 larger than 



max j j 8 + - jo- 2 + 2L 2 , 4oYl + -}L 



the aggregate f n satisfies inequality hlih) . 

Proof. We apply Theorem [2j The characteristic function of the double exponential 
density is ip(t) = 2/(2 + a 2 t 2 ). Solving <p(t)<p{(t) = ip((n + l)t/n) we get the 
characteristic function tp^ of Ci- The corresponding Laplace transform in this 
case is L^(t) — ip^(—it), which yields 

L d t) = l + (2 " + 1) ^ 2 



2n 2 - (n + l) 2 <r 2 t 2 
Therefore 

log L c (t) < (2n + \)(o-t/n) 2 , \t\ < 



(n + l)a ' 

We now use this inequality to check assumption (C). Let (3 be larger than 4er(l 
1/njL. Then for all fi, fi' 6 £?*a we have 



(3 ~ (3 ~ (n + l)a 

and consequently 

4a 2 (2n+l)(/ M (x I )-/ A1 '(x 4 )) 2 



logi c l 2|/ M (a; i ) - J < 

This implies that 

f-U'\\*n-\\f- i r r / 2 (^(^) ~ 



exp 
where 



•3 V 



, 4a 2 (2n + l)||/ AI -/ M ,|| 2 



/? n/3 2 



This functional satisfies ^(/z, /z) = 1, and it is not hard to see that the mapping 
A* l— * ^p(p> A*') i s continuous in the total variation norm. Finally, this mapping is 
concave for every /3 > (8 + 4/n)er 2 + 2sup A ||/ — /a|| 2 by virtue of Lemma H] in 
the Appendix. Therefore, assumption (C) is fulfilled and the desired result follows 
from Theorem [2j 

□ 



An argument similar to that of Proposition Q] can be used to deduce from Theo- 
rem[5]that if the random variables £, are i.i.d. Gaussian Af(Q, a 2 ), then inequality 
([10)) holds for every (3 > (4 + 2/n)a 2 + 2L 2 (cf. [14]). However, in this Gaussian 
framework we can also apply Theorem [T] that gives better result: essentially the 
same inequality (the only difference is that the Kullback divergence is divided by 
n and not by n + 1) holds for (3 > 4cr 2 , with no assumption on the function /. 
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5. Model selection with finite or countable A 

Consider now the particular case where A is countable. W.l.o.g. we suppose that 
A = {1, 2, . . . }, {fx, AeA} = {f]}fLi and we set nj = n(X = j). As a corollary of 
Theorem [2] we get the following sharp oracle inequalities for model selection type 
aggregation. 

Theorem 3. Let either assumptions of Theorem^ or those of Theorem^ be sat- 
isfied and let A be countable. Then for any (3 > (3o the aggregate f n satisfies the 
inequality 

where f3o = 4||^||oo when Theorem [7] is applied. In particular, if TTj = 1/M, 
j = l,..., M , we have, for any (3 > /3q, 

(16) E(\\f n -f\\i) < . mm J fj -ff n + ^£. 

Proof. For a fixed integer jo > 1 we apply Theorems [JJ or [2] with p being the Dirac 
measure: p(X = j) = t(j = jo), j > 1. This gives 

E(\\f n -f\\l)<\\f j0 -f\\l + ^^. 

Since this inequality holds for every jo, we obtain the first inequality of the propo- 
sition. The second inequality is an obvious consequence of the first one. □ 

Theorem [3] generalizes the result of [55] where the case of finite A and Gaussian 
errors & is treated. For this case it is known that the rate of convergence (log M)/n 
in (|16p cannot be improved [3 1 1 [6]. Furthermore, for the examples (i) - (iii) of 
Section [3] (Gaussian or bounded errors) and finite A, inequality (fTT>)) is valid with 
no assumption on / and f\. Indeed, when A is finite the integr ability conditions are 
automatically satisfied. Note that, for bounded errors £j, oracle inequalities of the 
form (|16p are also established in the theory of prediction of deterministic sequences 
[35l [26l [Til 122 H3] ■ However, those results require uniform boundedness not only 
of the errors £j but also of the functions / and f\. What is more, the minimal 
allowed values of (3 in those works depend on an upper bound on / and f\ which 
is not always available. The version of (JT5J) based on Theorem [TJ is free of such a 
dependence. 

6. Risk bounds for general distributions of errors 

As discussed above, assumption (B) restricts the application of Theorem [2] to 
models with n-divisible errors. We now show that this limitation can be dropped. 
The main idea of the proof of Theorem[5Jwas to introduce a dummy random vector 
C, independent of £. However, the independence property is stronger than what 
we really need in the proof of Theorem [5] Below we come to a weaker condition 
invoking a version of the Skorokhod embedding (a detailed survey on this subject 
can be found in [28]). 

For simplicity we assume that the errors £j are symmetric, i.e., P(£j > a) = 
P{£i < — a) for all a € M. The argument can be adapted to the asymmetric case as 
well, but we do not discuss it here. 
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First, we describe a version of Skorokhod's construction that will be used below, 
cf. [301 Proposition II.3.8]. 

Lemma 2. Let £i, . . . , be i.i.d. symmetric random variables on (f2, P). Then 
there exist i.i.d. random variables & > • • • j Cn defined on an enlargement of the prob- 
ability space (f2, J 7 , P) such that 

(a) £ + C has the same distribution as (1 + l/n)£. 

(b) E(&\t) =0, i = l,...,n, 

(c) for any A > and for any i = 1, . . . , n, we have 

E{e^%)<e^ 2{n+1 ^ n2 . 

Proof. Define Q as a random variable such that, given £ i; it takes values £i/n or 
— 2£j — £i/n with conditional probabilities P(d = Ci/ n \^i) = + + 2) and 

P(Ci = — 2£i — £i/«|£i) = l/(2n + 2). Then properties (a) and (b) are straightfor- 
ward. Property (c) follows from the relation 

and Lemma [3] in the Appendix with x = X^i/n and ao — 2n + 2. □ 

We now state the main result of this section. 

Theorem 4. Fix some a > cmd assume that sup AgA ||/ — f\\\ n < L for a finite 
constant L. If the errors & are symmetric and have a finite second moment E(£f), 
then for any (3 > 4(1 + l/n)a + 2L? we have 



(17) E(\\f n -ff n )< j \\f x -f\\lp(dX)+^^- + V P €# A , 

where the residual term R n is given by 

r> P J B ^ ^n + im -a^h^-fe.^)) 2 
tin = h, I sup 



n 2 p 

and E* denotes the expectation with respect to the outer probability. 

Proof. We slightly modify the proof of Theorem [2j We now consider a dummy 
random vector £ = (£i, . . . , ( n ) as in Lemma [21 Note that for this £ relation (fTTj) 
remains valid: in fact, it suffices to condition on to use Lemma[2l(b) and the fact 
that 9\ is measurable with respect to Therefore, with the notation of the proof 
of Theorem [21 we have E(\\f n — /||^) = S + Si. Using Lemma 0a) and acting 
exactly as in the proof of Theorem [2] we get that S is bounded as in (fT5j) . Finally, 
as shown in the proof of Theorem [2] the term Si satisfies 

Sl < /reiog jT xEi exp { - m - II A ZM± ^ - g**) } ff(dA) . 

According to Lemma [2tc), 

^^(hx-H...)/^ < c::p \j2 4(n+ j ^- (Xl))2 ^ ] . 

Therefore, Si < S2 + i? n , where 

7a v n P p ' 
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Finally, we apply Lemma |4] (cf. Appendix) with s 2 = 4a(n + 1) and Jensen's 
inequality to get that S2 < 0. □ 

In view of Theorem[U to get the bound (TTUj) it suffices to show that the remainder 
term R n is non-positive under some assumptions on the errors More generally, 
we may derive somewhat less accurate inequalities than (| 10[) by proving that R n is 
small enough. This is illustrated by the following corollaries. 

Corollary 1. Let the assumptions of Theorem^be satisfied and let |&| < B almost 
surely where B is a finite constant. Then the aggregate f n satisfies inequality flO\) 
for any (3 > 4B 2 (1 + l/n) + 2L 2 . 

Proof. It suffices to note that for a — B 2 we get R n < 0. □ 

Corollary 2. Let the assumptions of Theorem ^ be satisfied and suppose that 
£(e*l«*l K ) < B for some constants t > 0, n > 0, B > 0. Then for any n > e 1 ^ and 
any & > 4(1+ l/n)(2(logn)/t) 2 / K + 2L 2 we have 



(18) E(\\f n -f\\l) < f II/a - f\\lp(d\) + 



(3IC(p,n) 



l6BL 2 (n+ l)(21ogn) 2 / K w ^ 

In particular, if A = {1, . . . , M} and 7r zs t/ie uniform measure on A we gei 

(19) E{\\f n -ff n ) < jJ$ n Jfj -ff n + ^ 

16EL 2 (n+ l)(21ogn) 2 / K 
+ n 2 (3t 2 / K ' 

Proof. Set a = (2(logn)/t) 2 / K and note that 



(20) Rn < sup ||/A-/XE^- a )+ 



lg^(n + l) a 

< 7) - a ) + 

where a + = max(0, a). For any x > {2/(tK)) 1 ' K the function x i— > x 2 e~ tx is 
decreasing. Therefore, for any n > e 1 ^ we have x 2 e _te " < ae - *""^ 2 = a/n 2 , as 
soon as x 2 > a. Hence, E(£ 2 ~ a)+ < Ba/n 2 and the desired inequality follows. □ 

Corollary 3. Assume that sup AeA ||/ — /a||<x> < L and the errors £j are symmetric 
with E{\^i\ s ) < B for some constants s > 2, B > 0. Then for any ao > and any 
[3 > 4(1 + l/n)a n 2 /( s+2 ) + 2L 2 we have 

E(\\fn-f\\l) < Jjfx-f\\lp(dX) + ^^ + Cn-^+ 2 \ Vp€^ A . 

where C > is a constant that depends only on s, L, _B and ao- 

Proof. Seta = a 7i 2 ^ s+2 ^. In view of the inequality (f\(xi)—fg.n(xi)) 2 < 4sup AeA ||/- 
/a||oo) the remainder term of Theorem |4] can be bounded as follows: 

f »=1 
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To complete the proof, it suffices to notice that — a)+ = E(£fl(£f > a)) < 
E (\€i I s ) Iol s I' 2 ~ 1 by the Markov inequality. □ 

Corollary [2] shows that if the tails of the distribution of errors have exponential 
decay and if (3 is of the order (logn) 2 ' K , then the rate of convergence in the bound 
p^|) is of the order (log n) » (log M)/n. The residual R n in Corollary [2] is of a 
smaller order than this rate and can be made even further smaller by taking a — 
(u(logn)/t) 2 / K with u > 2. For k = 1, comparing Corollary [2] with the risk bounds 
obtained in [9l [20] for an averaged algorithm in i.i.d. random design regression, we 
see that an extra logn multiplier appears. It is noteworthy that this deterioration of 
the convergence rate does not occur if only the existence of finite (power) moments 
is assumed. In this case, the result of Corollary [3] provides the same rates of 
convergence as those obtained under the analogous moment conditions for model 
selection type aggregation in the i.i.d. case, cf. [20"1 [2], 

7. Sparsity oracle inequalities with no assumption on the dictionary 

In this section we assume that fx is a linear combination of M known functions 
cf>l, ... , 4>Mi where <pj : X — > R, with the vector of weights A = (Ai, . . . , Xm) that 
belongs to a subset A of R M : 

M 
3=1 

The set of functions {(f>i, . . . , <Pm} is called the dictionary. 

Our aim is to obtain sparsity oracle inequalities (SOI) for the aggregate with 
exponential weights /„ . The SOI are oracle inequalities bounding the risk in terms 
of the number M(A) of non-zero components (sparsity index) of A or similar char- 
acteristics. As discussed in Introduction, the SOI is a powerful tool allowing one 
to solve simultaneously several problems: sparse recovery in high-dimensional re- 
gression models, adaptive nonparametric regression estimation, linear, convex and 
model selection type aggregation. 

For A € K M denote by J(A) the set of indices j such that Xj ^ 0, and set 
A/(A) = Card(J(A)). For any r > 0, < Lq < oo, define the probability densities 

(2D ^ = ^TTW' yteR > 

i M 

(22) q(X) = — Hr- 1 qQ (X 3 /r)l(\\X\\ < L ), VA e M M , 

where Co = Co(r, M, Lq) is a normalizing constant such that q integrates to 1, and 
|| A || stands for the Euclidean norm of A S M. M . 

In this section we choose the prior n in the definition of fx as a distribution 
on R M with the Lebesgue density q: n(dX) = q(X)dX. We will call it the sparsity 
prior. 

Let us now discuss this choice of the prior. Assume for simplicity that L = oo 
which implies Co = 1. Then the aggregate f n based on the sparsity prior can be 
written in the form /„ = ft, where A = (Ai, . . . , Am) is the posterior mean in the 
"phantom" parametric model (TJJ: 

A, = f \je n (X)dX, j = l,...,M, 
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with the posterior density 

(23) e n (X) = Cexp{-n||Y-M|2//3 + log g (A)} 

M 

= C'exp{-n||Y-/ A ||2//3-4^1og(l + |A J |/r)}. 

3=1 

Here C > 0,C > are normalizing constants, such that n {-) integrates to 1. To 
compare our estimator with those based on the penalized least squares approach 
(BIC, Lasso, bridge), we consider now the posterior mode A of #„(•) (the MAP 
estimator) instead of the posterior mean A. It is easy to see that A is also a penalized 
least squares estimator. In fact, it follows from (f2"3")l that the MAP estimator is a 
solution of the minimization problem 

Aft M 

(24) A = arg mm {||Y- / A ||» + JL J>g(l + \Xj\fr)}. 

3=1 

Thus, the MAP "approximation" of our estimator suggests that it can be heuris- 
tically associated with the penalty which is logarithmic in Xj. In the sequel, we 
will choose r very small (cf. Theorems 5 and 6 below). For such values of r the 
function Xj i— > log(l + |A,-|/t) is very steep near the origin and can be viewed as 
a reasonable approximation for the BIC penalty function Xj > l(Aj ^ 0). The 
penalty log(l + |Aj|/t) is not convex in Xj, so that the computation of the MAP 
estimator (|24]l is problematic, similarly to that of the BIC estimator. On the other 
hand, our posterior mean /„ is efficiently computable. Thus, the aggregate /„ with 
the sparsity prior can be viewed as a computationally feasible approximation to 
the logarithmically penalized least squares estimator or to the closely related BIC 
estimator. Interestingly, the results that we obtain below for the estimator f n are 
valid under weaker conditions than the analogous results for the Lasso and Dantzig 
selector proved in [3l [7] and are sharper than those for the BIC [6] since we get 
oracle inequalities with leading constant 1 that are not available for the BIC. 

Note that if we redefine qo as the double exponential density, the corresponding 
MAP estimator is nothing but the penalized least squares estimator with the Lasso 
penalty ~ Y^f=i l-M- More generally, if qo(t) ~ exp(— |t| 7 ) for some < 7 < 2, 
the corresponding MAP solution is a bridge regression estimator, i.e., the penalized 
least squares estimator with penalty ~ Sj=i I'M 7 [IS]- The argument that we 
develop below can be easily adapted for these priors, but the resulting SOI are not 
as accurate as those that we obtain in Theorems 5 and 6 for the sparsity prior (|21|) . 
(T2"2")) . The reason is that the remainder term of the SOI is logarithmic in Xj when 
the sparsity prior is used, whereas it increases polynomially in Xj for the above 
mentioned priors. 

We first prove a theorem that provides a general tool to derive the SOI from the 
PAC-Bayesian bound |8]). Then we will use it to get the SOI in more particular 
contexts. Note that in this general theorem /„ is not necessarily an exponentially 
weighted aggregate defined by ((2|). It can be any /„ satisfying ([8]). The result of 
the theorem obviously extends to the case where a remainder term as R n (cf. (fT7|) ) 
is added to the basic PAC-Bayesian bound {8j). 

Theorem 5. Let f n satisfy (0) with 7r(dA) = q(X) dX and r < SLq/^/M where 
< Lo < 00, < 6 < 1. Assume that A contains the ball {X G R M : \\X\\ < L }. 
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Then for all X* such that ||A*|| < (1 — 5)Lq we have 

4/3 



E 



(||/»-/||*) <||/a*-/||*+-^ E logCl + r-^A^D+iZfM.r.Lo,*), 

2/3t 3 A/ 5 / 2 



jeJ(x*) 

where the residual term is 

M 

R(M, T, L , S) = T 2 e 2T 3 M^( S L )-* £ ||^ 



J= i u 



/orl < oo and i?(M, r, oo, (5) = t 2 ^ = i iuj|2 



j 1 1 n • 



Proof. We apply Theorem El with p(dA) = C^q{\ ~ A*)1(||A - A*|| < SL )dX, 
where C\* is the normalizing constant. Using the symmetry of q and the fact that 
fx - fx- = fx-X* = ~fx*-X wc get 

(/A- - /, fx ~ fx*)nP(d\) = C-, 1 f (/a* ~ /, f W )n qH dw = 0. 

Therefore J A \\f x - f\\ 2 n p(dX) = \\f x , -f\\l + J A \\fx - fx4lp(dX). On the other 
hand, bounding the indicator 1(||A — A*| < 5L Q ) by one and using the identities 
Jr 1o{t) dt — J R t 2 q (t) dt = 1, we obtain 

M i- „.,2 ^2 V^ M |U .112 



l Wfx fx4lp(dX) < -l-J^' jf V ft(^) *»i 

Since 1 — x > e~ 2x for all x £ [0, 1/2], we get 

M M 



Em 
.7 = 1 



"J \\n 



CqC\* 



J\\X\\<dL j=1 J = l J\X j \<-^ ! 

SLq/tVM \ M / 1 ^ M 



(l + i)V V (l + (5L r- 1 M- 1 /2)3 
> ex P ( - (1 + ,J A lM - 1/2) a ) > exp(-^/»(*Lo)-»). 

On the other hand, in view of the inequality 1 + |Aj/t| < (1 + |A*/t|)(1 + |Aj-A*|/t) 
the Kullback-Leibler divergence between p and it is bounded as follows: 

K(p,n)= f log ( ° x ' q ^~ X * } ) P (dX) < 4^1og(l + l-r^A*!) - logC A . . 

Easy computation yields Co < 1. Therefore C\* > CqC\* > cxp(— ^jj^p-) and 
the desired result follows. □ 

Inspection of the proof of Theorem O shows that our choice of prior density go 
in (I2ip is not the only possible one. Similar result can be readily obtained when 
qo(t) ~ |t|~ 3 ~ 5 , as \t\ — > oo, for any 5 > 0. The important point is that qo(t) should 
be symmetric, with finite second moment, and should decrease not faster than a 
polynomial, as \t\ — > oo. 

We now explain how the result of Theorem [5] can be applied to improve the SOI 
existing in the literature. In our setup the values deterministic. For 

this case, SOI for the BIC, Lasso and Dantzig selector are obtained in [51 [51 |4"CT1 15] . 
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In those papers the random errors are Gaussian. So, we will also focus on the 
Gaussian case, though similar corollaries of Theorem 5 are straightforward to obtain 
for other distributions of errors satisfying the assumptions of Sections 3, 4 or 6. 

Denote by $ the Gram matrix associated to the family (4>j)j=i,... : M, i-e., the 
M x M matrix with entries j> = n^ 1 Y^h=i 4>j (^^f i x i) i hf € {!) •■•■> M}, and 
denote by Tr($) the trace of Set log + x = max(loga;, 0), V x > 0. 

Theorem 6. Let f n be defined by with ir(dX) = q(X) dX and Lo — oo. Let £j be 
i.i.d. Gaussian J\f(0,a 2 ) random variables with a 2 > and assume that (3 > 4er 2 , 
Tr($) > 0. Set r = . Then for all X* e K M we have 

E(llA _ /e)s|l/A ._ /e + « (l+l0g+{ M,. |l}) 4 

^e re |A*|! = E^i|A*|. 

Proof. To apply Theorem [5] with Lq — oo, we need to verify that f n satisfies ([8]). 
This is indeed the case in view of Theorem [T] Thus we have 

(25) E (\\f n -ff n )<\\h*-f\\ 2 n + ^- loga + r-^AjD+r 2 ^*). 

By Jensen's inequality, T, je j(A*) logfl+^IAJI) < M(A*) log(l + |A*|i/(r M(A*))). 
Since log(l + 1 A* 1 1 / (tM(X* ) ) ) < 1+ log + (| A* j x / (tM( A* ))) , the result of the theorem 
follows from the choice of r. □ 

Theorem [5] establishes a SOI wwt/j leading constant 1 and wzf/i no assumption 
on the dictionary. Of course, for the inequality to be meaningful, wc need a mild 
condition on the dictionary: Tr(<I>) < oo. But this is even weaker than the stan- 
dard normalization assumption = 1, j — 1, . . . , M. Note that a BIC type 
aggregate also satisfies a SOI similar to that of Theorem [S] with no assumption on 
the dictionary (cf. [B]), but with leading constant greater than 1. However, it is 
well-known that the BIC is not computationally feasible, unless the dimension M 
is very small (say, M — 20 in the uppermost case), whereas our estimator can be 
efficiently computed for much larger M. 

The oracle inequality of Theorem [S] can be compared with the analogous SOI 
obtained for the Lasso and Dantzig selector under deterministic design [SJ [5] . Sim- 
ilar oracle inequalities for the case of random design xi,...,x n can be found in 
pj [2H [53] . All those results impose heavy restrictions on the dictionary in terms 
of the coherence introduced in [16] or other analogous characteristics that limit the 
applicability of the corresponding SOI, see the discussion after Corollary Q] below. 

We now turn to the problem of high-dimensional parametric linear regression, 
i.e., to the particular case of our setting when there exists A* S 1R M such that 
f = fx* ■ This is the framework considered in [51 140j and also covered as an example 
in [3j. In these papers it was assumed that the basis functions are normalized: 
||<^j||„ = 1, j = 1, . . . ,M, and that some restrictive assumptions on the eigenvalues 
of the matrix $ hold. We only impose a very mild condition: H^H 2 < <fio, j = 

1 . . . . , M, for some constant cf>o < oo . 

Corollary 4. Let f n be defined by with ir(dX) = q(X) dX and Lq = oo. Let £j be 

1.1. d. Gaussian Af(0,o- 2 ) random variables with a 2 > and assume that (3 > Acr 2 . 
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Set t = -j==. If there exists A* G M. M such that f — f\* and ||0j|| 2 > < </>o> 
j = 1, . . . , M, for some cj>Q < oo, we have 

(20, £ ( ll /„- /e ) £ M M( V,(l + lo 6+ {^f ,V|,}) + £. 

Proof is based on the fact that Tr($) = \\<t>i\\n < M( t>0 in ©■ 

Under the assumptions of Corollary El the rate of convergence of f n is of the or- 
der 0(M(X*)/n), up to a logarithmic factor. This illustrates the sparsity property 
of the exponentially weighted aggregate /„: if the (unknown) number of non-zero 
components M(A*) of the true parameter vector A* is much smaller than the sam- 
ple size n, the estimator f n is close to the regression function /, even when the 
nominal dimension M of A* is much larger than n. In other words, /„ achieves 
approximately the same performance as the "oracle" ordinary least squares that 
knows the set J(A*) of non-zero components of A*. Note that similar performance 
is proved for the Lasso and Dantzig selector jBJ El HOI E] , however the risk bounds 
analogous to pi))) for these methods are of the form O (M(A*) (log M)/ («;„, 14-71)), 
where n n ,M is a "restricted eigenvalue" of the matrix $ which is assumed to be 
positive (see Ej for a detailed account). This kind of assumption is violated for 
many important dictionaries, such as the decision stumps, cf. [3J, and when it is 
satisfied the eigenvalues ft n ,M can be rather small. This indicates that the bounds 
for the Lasso and Dantzig selector can be quite inaccurate as compared to (ESI). 

8. Appendix 

Lemma 3. For any x el and any a > 0, x + log (l + -^(e~ xa ° - l)) < ^f 2 -. 

Proof. On the interval (— 00, 0], the function x 1— ► x + log (l + -^{e~ xa ° — 1)) is 
increasing, therefore it is bounded by its value at 0, that is by 0. For positive 
values of x, we combine the inequalities e~ y < 1 — y + y 2 jl (with y — xcto) and 
log(l + y) < y (with y = 1 + ±(e~ xa ° - 1)). □ 

Lemma 4. For any (3 > s 2 /n + 2sup AeA \\f — /a||^ and for every // S &' A , the 
function 

'WU'-UWl Wf-Uf- 



[i 1— > exp 
is concave. 



n/3 2 (3 



Proof. Consider first the case where Card(A) — m < 00. Then every element of 
can be viewed as a vector from R m . Set 

QO) = (1 - 7)11/ - UWl + 2t</ -UJ- U>)n 

= (1 - i)i?HlH n v + 2 lt i T HlH nl i', 

where 7 = s 2 /(n/3) and H n is the n x m matrix with entries (f(xi) — f\(xi)) / \fn. 
The statement of the lemma is equivalent to the concavity of e~^^/^ as a function 
of /i € &a 1 which holds if and only if the matrix /3V 2 Q(/i) — V<5(/i)VQ(/i) T is 
positive-scmidefmite. Simple algebra shows that V 2 Q(/i) = 2(1 — j)H^H n and 
VQ(/i) = 2H%[{1 - 7)-H"nM + lH n fi'}. Therefore, VQ{ i i)VQ{y) T = H?MH n , 
where M = 4H n fifj T with fi = (1 — j)fj, + j/jf. Under our assumptions, (3 is 



2(1 
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larger than s 2 /n, ensuring thus that jl G Clearly, M is a symmetric and 

positive-semidefinite matrix. Moreover, 



X-max 



A n / \ : 

(M) < Tr(M) = 4\\H n fA\ 2 = - E E W 

n i=i V AeA ' 



^ £ £ ^(/(^) - AO*)) 2 = 4 E foil/ - AH* 

i=i AeA AeA 

<4max||/-/ A ||2 
AeA 

where A max (M) is the largest eigenvalue of M and Tr(M) is its trace. This estimate 
yields the matrix inequality 

VQ(/x)VQ(/i) T < 4max||/ - f x f n H%H n . 

Hence, the function e"'"^/' 3 is concave as soon as 4max> e A ||/ — f\\\ 2 n < 2/3(1 — 7). 
The last inequality holds for every (3 > n~ 1 s 2 + 2 max> e A ||/ — fx\\n- 

The general case can be reduced to the case of finite A as follows. The concavity 

of the functional G(ji) = cxp ( - ll/ ^ 2 / " 11 " - '^"J^ 11 " ) is equivalent to the validity 
of the inequality 

(27) G[~^—j> , V/i,(i€^ A . 

Fix now arbitrary n, p, G £?" A - Take A = {1, 2, 3} and consider the set of functions 
{fx, X e A} = fp,, ffj,'}- Since A is finite, &"r = g?~ K . According to the first 
part of the proof, the functional 

Pit \ ( S II/m' — f"\\n II/ - Alln\ /Tx 

G M=™v{ — ^ —y A' 

is concave on 3?~ K as soon as (3 > s 2 /n + 2max Ae ^ ||/ — /a||^, and therefore for 
every > s 2 /n + 2 

su PAeA 11/ — /aIIm as we ll. (Indeed, by Jensen's inequality for 
any measure n€0" A we have ||/ - /X < / 11/ - fx\\ll*(dX) < sup AeA ||/ - / A ||*.) 
This leads to 

Taking here the Dirac measures v and v defined by v(\ = j) = t(j = 1) and 
v(\ = j) = l(j = 2), j = 1, 2, 3, we arrive at l[27j) . This completes the proof of the 
lemma. □ 
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